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DISTRIBUTED CONTROL OF DATA FLOW 
IN A NETWORK SWITCH 

FIELD OF THE INVENTION 

The invention relates to network switches. More specifically, the invention 
relates to distributed control of data flow in network switches. 

BACKGROUND OF THE INVENTION 

In high bandwidth networks such as fiber optic networks, lower bandwidth 
services such as voice communications are aggregated and carried over a single fiber 
optic link. However, because the aggregated data can have different destinations some 
mechanism for switching the aggregated components is required. Switching can be 
performed at different levels of aggregation. 

Current switching is accomplished in a synchronous manner. Signals are 
routed to a cross-connect or similar switching device that switch and route signals at 
some predetermined granularity level, for example, byte by byte. Synchronous 
switching in a cross-connect is a logically straight forward method for switching. 
However, because data flow between network nodes is not necessarily consistent, 
switching bandwidth may not be used optimally in a synchronous cross-connect. One 
source of data may use all available bandwidth while a second source of data may 
transmit data sporadically. 

In order to support data sources that transmit at or near peak bandwidth, 
cross-connects are designed to provide the peak bandwidth to all data sources because 
specific data rates of specific data sources are not known when the cross-connect is 
designed. As a result, all data paths through the cross-connect provide the peak 
bandwidth, which may not be consumed by some or even most of the data sources. 

A further disadvantage of synchronous switching architectures is that 
centralized switching control and interconnections grow exponentially as the 
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input/output paths grow. Therefore, large switching architectures are complex and 
require complex control algorithms and techniques. 
SUMMARY OF THE INVENTION 

A network switch is described. The network switch includes ingress cards to 
receive data from sources external to the switch and egress cards to transmit data to 
devices external to the switch. The ingress cards have an ingress buffer to temporarily 
store data, an ingress scheduler coupled to the ingress buffer, and a set of ports coupled 
to the ingress scheduler. The ingress scheduler reads data from the ingress buffer and 
selectively transfers the data to one of the set of ports. The egress cards have a set of 
ports coupled to receive data from respective ingress card ports. The egress cards also 
have an egress buffer coupled to the set of egress card ports. The egress buffer 
selectively reads data from the ports and stores the data. An egress scheduler is 
coupled to the egress buffer. The egress scheduler reads data from the egress buffer 
and transmits data to the external devices. 
BRIEF DESCRIPTION OF THE DRAWINGS 

The invention is illustrated by way of example, and not by way of limitation, in 
the figures of the accompanying drawings in which like reference numerals refer to 
similar elements. 

Figure 1 illustrates one embodiment of a network architecture having multiple 
network switches. 

Figure 2 illustrates one embodiment of an interconnection of cards within a 
network switch. 

Figure 3 conceptually illustrates one embodiment of an ingress scheduler. 

Figure 4 conceptually illustrates one embodiment of a egress cache scheduling 
of egress card ports. 
DETAILED DESCRIPTION 

Techniques for distributed control of data flow in a network switch are 
described. In the following description, for purposes of explanation, numerous specific 
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details are set forth in order to provide a thorough understanding of the invention. It 
will be apparent, however, to one skilled in the art that the invention can be practiced 
without these specific details. In other instances, structures and devices are shown in 
block diagram form in order to avoid obscuring the invention. 

Reference in the specification to "one embodiment" or "an embodiment" 
means that a particular feature, structure, or characteristic described in connection with 
the embodiment is included in at least one embodiment of the invention. The 
appearances of the phrase "in one embodiment" in various places in the specification 
are not necessarily all referring to the same embodiment. 

The network switch described herein provides a cell/packet switching 
architecture that switches between line interface cards across a meshed backplane. In 
one embodiment, the switching can be accomplished at, or near, line speed in a 
protocol independent manner. The protocol independent switching provides support 
for various applications including, but not limited to, Asynchronous Transfer Mode 
(ATM) switching, Internet Protocol (IP) switching, Multiprotocol Label Switching 
(MPLS) switching, Ethernet switching and frame relay switching. The architecture 
allows the network switch to provision service on a per port basis. 

In one embodiment, the network switch provides a non-blocking topology 
with both input and output queuing and per flow queuing at both ingress and egress. 
Per flow flow-control can be provided between egress and ingress scheduling. Strict 
priority, round robin, weighted round robin and earliest deadline first scheduling can be 
provided. In one embodiment, cell/packet discard is provided only at the ingress side 
of the switch. In one embodiment, early packet discard (EPD), partial packet discard 
(PPD) and random early discard (RED) are provided. 

Figure 1 illustrates one embodiment of a network architecture having 
multiple network switches. While the switches of Figure 1 are illustrated as coupled to 
only router/hosts and networks, any type of device that generates and/or receives data 
that can be carried by a wide area network can be used. The router/hosts and networks 
are intended to illustrate data devices and statistical multiplexing devices. 
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Switch 1 10 is coupled to router/host 100, network 102, network 104 and 
router/host 104. Any number of devices can be coupled to switch 1 10 in any manner 
known in the art. Similarly, router/host 122, network 124, network 126 and router/host 
128 are coupled to switch 120. Any number of device can be coupled to switch 120 in 
any manner known in the art. 

Switch 110 and 120 are coupled to switch 130. Switch 1 10 and 120 can 
also be coupled to other switch or other network devices (not shown in Figure 1). 
Switch 130 is also coupled to network 140, which can include any type and any 
number of network elements including additional switches. 

Switches 110, 120 and 130 receive data from multiple devices including 
router/hosts, local area networks and other switches. The switches can aggregate 
multiple data sources into a single data stream. Statistical aggregation allows multiple 
sources of packet/cell data to share a link or port. For example, 24 sources, each with 
sustained bandwidth of 64 kbps could share a DS1 (1.544 Mbps) link. Statistical 
aggregation allows sources of data to burst to bandwidth higher than their sustained 
rate, based on availability of bandwidth capacity of the link. Similarly, STS-1 (51.840 
Mbps) signals from three networks can be received and combined into an OC-3 
(155.520 Mbps) signal. The OC-3 signal can be transmitted to another switch for 
routing and/or further aggregation. 

In one embodiment, the switches of figure 1 include multiple cards that are 
interconnected by a switching fabric. In one embodiment, the cards have both an 
ingress data path and an egress data path. 

The ingress data path is used to receive data from the network and transmit 
the data to an appropriate card within the switch. The ingress data path schedules 
transmission of data across the switching fabric. 

The egress data path is used to receive data from the switching fabric and 
transmit the data across the network. The egress data path schedules transmission of 
data out of the switch across the network. The ingress and egress data paths interact to 
prevent overflow of data within the network switch. 
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Figure 2 illustrates one embodiment of an interconnection of cards within a 
network switch. The switch of Figure 2 can be, for example, any of switches 110, 120 
or 130 of Figure 1. The switch of Figure 2 is illustrated with four ingress cards and 
four egress cards for reasons of simplicity only. A switch can have any number of 
ingress cards and any number of egress cards. Also, data flow can be bi-directional. 
That is, the cards can also provide both egress and ingress functionality. 

Each ingress card includes an ingress buffer that receives data from an 
external source (not shown in Figure 2). Data can be in any format, for example, IP 
packets or ATM cells. The ingress buffers are coupled to ingress schedulers. The 
ingress schedulers dispatch data to egress cards via a set of ingress ports. In one 
embodiment, each ingress card has a port for each egress card to which the ingress card 
is coupled. For example, ingress card 0 is coupled to egress card 0 through port 0, to 
egress card 1 through port 1, to egress card 2 through port 2, and to egress card 3 
through port 3. 

In one embodiment, each ingress card is coupled to each egress card, the 
interconnection between the ingress cards and the egress cards has n 2 connections 
where n is the number of ingress/egress cards. Thus, the interconnection is referred to 
as an " n 2 mesh," or an " n 2 switching fabric." In another embodiment, the number of 
ingress cards is not equal to the number of egress cards, which is referred to as a " nxm 
mesh." The mesh is described in greater detail in U.S. Patent application number 
09/746,212, entitled "A FULL MESH INTERCONNECT BACKPLANE 
ARCHITECTURE," filed December 22, 2000, which is assigned to the corporate 
assignee of the present application and incorporated by reference. 

In one embodiment, traffic crosses the mesh, or switching fabric, in an 
asynchronous manner in that no central clock signal drives data across the mesh. Data 
is transmitted by the ingress cards without reference to a bus or mesh clock or frame 
synchronization signal. A protocol for use in communicating over the mesh is 
described in greater detail in U.S. Patent application number 09/745,982, entitled "A 
BACKPLANE PROTOCOL," filed December 22, 2000, which is assigned to the 
corporate assignee of the present invention and incorporated by reference. 
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Each egress card includes a port for each ingress card to which the ingress 
card is coupled. For example, egress card 0 is coupled to ingress card 0 through port 0, 
to ingress card 1 through port 1 , to ingress card 2 through port 2, and to ingress card 3 
through port 3. The ports of the egress card are coupled to an egress buffer. The 
egress buffer is coupled to an egress scheduler that outputs data to a device external to 
the egress card (not shown in Figure 2). 

The architecture illustrated in Figure 2 allows scheduling duties to be 
distributed between ingress and egress cards. Because the scheduling duties are 
distributed, a centralized scheduler is not required and transmission of data between 
ingress cards and egress cards can be accomplished in an asynchronous manner. This 
allows simpler control of data switching and more efficient use of switching fabric 
bandwidth. 

When data is received by an ingress card the data is temporarily stored in 
the ingress buffer on the card. The ingress scheduler extracts data from the ingress 
buffer and sends the data to the appropriate port. For example data to be transmitted to 
egress card 2 are sent to port 2. In one embodiment, the ingress scheduler reads and 
sends data according to an earliest deadline first scheduling scheme. In alternate 
embodiments, strict priority scheduling, round robin scheduling, weighted round robin 
scheduling, or other scheduling techniques can be used. 

Data that is transferred between ingress and egress cards can be variable in 
size. The data can be transmitted as a group of fixed length cells or as one or more 
variable length packets. In one embodiment, the packets on the ingress side compete 
with each other on a packet basis. Each packet competes against the other packets to 
be selected by the ingress scheduler. In one embodiment, when one packet is selected, 
all of the entire packet is moved across the switch fabric. Once the ingress scheduler 
selects a packet of a given priority, the packet is transmitted before another packet is 
selected. 

Data must be transferred from the ingress side of the switch to the egress 
side of the switch through the switching fabric. In an n 2 mesh, n ingress sources can 
potentially contend for a single egress destination. The switch is required to transfer 
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the data from ingress to egress such that the contracts of the individual data flows are 
honored. The contracts specify bandwidth, latency, jitter, loss and burst tolerance. All 
contracts must be honored under all traffic contention conditions. 

In one embodiment the ingress scheduler provides all cell/packet discard 
functionality. Cell/packet discard can include, for example, early packet discard 
(EPD), partial packet discard (PPD), random early discard (RED), each of which is 
known in the art. Additional and/or different cell/packet discard procedures can also be 
used. 

In one embodiment, the ingress schedulers independently schedule packets 
and cells to the egress side. To allow this independent scheduling to function, n 
separate buffers are provided on each egress, one per ingress. These independent cache 
buffers have sufficient bandwidth to allow simultaneous egress-side arrivals from all 
ingress devices. In one embodiment, the egress side sends "backpressure 55 messages to 
the ingress control access to the n independent cache buffers. Data in the n buffers is 
transferred to a larger egress buffer, from where it is scheduled to the egress ports. 

The egress buffer receives data from the ports of the egress card in a 
predetermined manner. For example, data can be extracted from the ports in a round 
robin fashion, or data can be extracted from the ports on a priority basis. Also, a 
combination of round robin and priority-based extraction can also be used. 

Data received from the egress card ports in stored in the egress buffer. In 
one embodiment, the egress buffer includes a cache for each link (i.e., link between 
ingress card 2 and egress card 2, link between ingress card 3 and egress card 2), Each 
cache includes a queue for each class of data. By including a queue for each class of 
data, the egress buffer can provide quality of service functionality. 

The egress scheduler extracts data from the egress buffer and transmits the 
data according to the appropriate network protocol to an external device (not shown in 
Figure 2). In one embodiment, the egress scheduler extracts data from the egress 
buffer based on priority to provide quality of service functionality. In alternate 
embodiments, the egress scheduler can extract data from the egress buffer using 
earliest deadline first scheduling. 
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Figure 3 conceptually illustrates one embodiment of an ingress scheduler. 
In one embodiment, incoming data is categorized into one of three ingress traffic 
categories (ITCs). Having only three ITCs simplifies the architecture of the ingress 
scheduler. One embodiment of a mapping of ITCs to ATM and IP traffic is set forth in 
the following table. Other types of network traffic can be mapped to the three ITCs in 
a similar manner. 



ITC 


ATM 


IP 


Real Time (RT) 


CBR, VBR-RT 


IntServ Guaranteed Services, 
DiffServ Expedited Forwarding, 
DiffServ Network Control Traffic 


Multicast (MC) 


All ATM multicast connections 


All IP multicast flows 


Non 
Real 
Time 


Class 0 


GFR, VBR-NRT 


IntServ Controlled Load Services, 
DiffServ Assured Forwarding 
class 1 


(NRT) 


Class 1 


UBR+, ABR with MCR>0 


DiffServ Assured Forwarding 
class 2 




Class 2 


ABR with MCR-0 


DiffServ Assured Forwarding 
class 3 




Class 3 


UBR 


DiffServ Assured Forwarding 
class 4, Best Effort 



Ingress Traffic Category Mapping 

In one embodiment, servicing of the three ITCs is accomplished according 



to the following priority: 1) RT, 2) MC, and 3) NRT, assuming no backpressure signals 
are active. If backpressure signals are active, the transmission of the corresponding 
category of data is stopped to avoid egress port buffer overflow. Lower priority data 
can be transmitted when the backpressure signal is active for higher priority data. 

Ingress router 300 reads data from the ingress buffer (not shown in Figure 
3) and sends the data to the appropriate queue based on the ITC mapping described 
above. Real time data is sent to one of the RT group queues (e.g., 305, 310), multicast 
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data is sent to MC queue 320, and non-real time data is sent to one of the NRT queues 
(e.g., 330, 335). In one embodiment, the ingress scheduler includes 16 RT queues; 
however, any number of RT queues can be provided. 

In one embodiment, data is read out of the RT group queues in a round 
robin fashion. In another embodiment, the data is selected from the multiple RT group 
queues on a priority basis. Similarly, data is read out of the NRT class queues in a 
round robin fashion. The data is selected from the multiple NRT class queues in a 
weighted round robin fashion. Data from the three categories of data (RT, MC, NRT) 
is selected based on a priority basis to be sent to the appropriate ingress port 390. 

Data flow control is described in greater detail in U.S. Patent application 
number 09/812,985 (Atty. Docket No. P017) filed March 19, 2001, entitled 
"METHOD AND SYSTEM FOR SWITCH FABRIC FLOW CONTROL," which is 
assigned to the corporate assignee of the present U.S. Patent application and 
incorporated by reference herein. 

Figure 4 conceptually illustrates one embodiment of a egress cache 
scheduling of egress card ports. In one embodiment, each egress card port has three 
associated FIFO buffers for real time data, multicast data and non-real time data, 
respectively. 

In one embodiment, when data is received by egress port 400 the data is 
sent to one of three queues. The queues correspond to the ITCs. Real time data is 
stored in real time queue 4 1 0, multicast data is stored in multicast queue 420, and non- 
real time data is stored in non-real time queue 430. Data is read out of the queues in a 
round robin fashion. The queue from which data is transmitted is selected on a priority 
basis. 

In the foregoing specification, the invention has been described with 
reference to specific embodiments thereof. It will, however, be evident that various 
modifications and changes can be made thereto without departing from the broader 
spirit and scope of the invention. The specification and drawings are, accordingly, to 
be regarded in an illustrative rather than a restrictive sense. 
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CLAIMS 

What is claimed is: 

1 . A network switch having an asynchronous mesh to transfer data from 
ingress interfaces to egress interfaces, the ingress interfaces to receive data from 
external sources and to selectively transmit the data across the asynchronous mesh to 
the egress interfaces, the egress interfaces to receive data from the asynchronous mesh 
and to transmit the data to external destinations. 

2. The network switch of claim 1 wherein the ingress interfaces schedule 
respective data transmissions across the mesh and the egress interfaces schedule 
respective data transmissions to the external destinations. 

3. The network switch of claim 2 comprising N ingress interfaces, each of 
the egress interfaces further comprising N independent cache buffers coupled to N 
respective ingress interfaces to receive data from the respective N ingress interfaces. 

4. The network switch of claim 2 comprising N ingress interfaces, each of 
the N ingress interfaces having N independent cache buffers, each of the N independent 
cache buffers coupled to one of N respective egress interfaces. 

5. The network switch of claim 2 wherein one or more of the N ingress 
interfaces segregates incoming data into queues based on one or more of: a flow 
identifier, a user identifier, a session identifier, a quality of service (QoS), a priority, a 
deadline, and a service class. 
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6. The network switch of claim 3 in which the egress interfaces generate a 
flow control signal to prevent access to one or more of the N buffers of the respective 
egress interfaces. 

7. The network switch of claim 3 wherein the egress interfaces generate a 
flow control signal to prevent transmission to one or more of the N buffers of the 
respective egress interfaces. 

8. The network switch of claim 3 wherein the N ingress interfaces transfer 
data to a shared egress buffer and further wherein the egress interfaces schedule and 
retrieve the data stored in the shared egress buffer prior to transmitting the data to the 
external destinations. 

9. The network switch of claim 5 in which the egress interfaces generate a 
flow control signal to prevent access by one or more of the queues at the ingress 
interfaces to the egress buffer. 

10. The network switch of claim 3 in which the N ingress interfaces 
concurrently transmit fixed-length cells and variable-length packets across the mesh to 
the egress interfaces. 

11. A network switch comprising: 

a plurality of ingress cards, the plurality of ingress cards having an ingress 
buffer to temporarily store data, an ingress scheduler coupled to the ingress buffer, and 
a plurality of ports coupled to the ingress scheduler, the ingress scheduler to read data 
from the ingress buffer and to selectively transfer the data to one of the plurality of 
ports; and 
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a plurality of egress cards, the plurality of egress cards having a plurality of ports 
coupled to receive data from respective ingress card ports, an egress buffer coupled to 
the plurality of ports, the egress buffer to selectively read data from the plurality of 
ports and to store the data, and an egress scheduler coupled to the egress buffer, the 
egress scheduler to read data from the egress buffer and to transmit data from the 
egress card. 

12. The network switch of claim 1 1 wherein the ingress scheduler transfers 
data from the ingress buffer in a first in/first out (FIFO) manner. 

13. The network switch of claim 1 1 wherein the ingress scheduler transfers 
data from the ingress buffer according to priorities associated with the data. 

1 4. The network switch of claim 1 1 wherein the ingress buffer receives 
telecommunications data. 

15. The network switch of claim 1 1 wherein the plurality of ports of the 
egress cards further comprise one or more buffers to temporarily store data received 
from the respective ingress cards, and further wherein if a buffer of a port of an egress 
card is full the buffer and the port refuse data transmitted from the associated ingress 
card. 

16. The network switch of claim 15 wherein the associated ingress card 
from which data was refused retransmits the refused data until the associated egress 
port and buffer accept the previously refused data. 

17. The network switch of claim 1 1 wherein the egress buffers include a 
data store for each of the plurality of ports of the egress card. 
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1 8. The network switch of claim 1 7 wherein the data store for each of the 
plurality of ports of the egress card stores the data according to an associated class. 

19. A network switch comprising: 

N ingress cards coupled to receive data from external sources, the N ingress 
cards having a plurality of ports to transmit data, wherein each of the N ingress cards 
comprises an ingress scheduler coupled to the ports of the ingress card, the ingress 
scheduler to cause data to be selectively and asynchronously transmitted via the ports 
of the ingress card; and 

M egress cards having ports coupled to receive data from one or more of the 
plurality of ports of the N ingress cards, the egress cards coupled to transmit data to 
external destinations, wherein each of the M egress cards comprises an egress 
scheduler coupled to the ports of the egress card, the egress scheduler to cause data to 
be selectively transmitted to the external destinations. 

20. The network switch of claim 19 wherein N and M are equal. 

2 1 . The network switch of claim 1 9 wherein one or more of the ingress 
interfaces segregates incoming data into queues based on one or more of: a flow 
identifier, a user identifier, a session identifier, a quality of service (QoS), a priority, a 
deadline, and a service class. 
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NEW ABSTRACT 

The network switch (110) described herein provides a cell/packet switching architecture (Fig, 1) that switches between line 
interface cards across a meshed backplane. In one embodiment, the switching can be accomplished at, or near, line speed in a 
protocol independent manner. The protocol independent switching provides support for various applications including 
Asynchronous Transfer Mode (ATM) switching, Internet Protocol (IP) switching, Multiprotocol Label Switching (MPLS) 
switching, Ethernet switching and frame relay switching. The architecture allows the network switch (110) to provision service on 
a per port basis. In one embodiment, the network switch (110) provides a non- blocking topology with both input and output 
queuing and per flow queuing at both ingress and egress. Per flow flow-control can be provided between ingress and egress 
scheduling. Strict priority, round robin, weighted round robin and earliest deadline first scheduling can be provided. 
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